Regression diagnostics
Published:
This post covers Introduction to probability from Statistics for Engineers and Scientists by William Navidi.
Basic Ideas
Interpreting the Slope of the Least-Squares Line
If the x-values of two points on a line differ by $1$, their y-values will differ by an amount
equal to the slope of the line.
If the values of the explanatory variable for two individuals differ by $1$, their predicted values will differ by $ \hat \beta_1$.
If the values of the explanatory variable differ by an amount $d$, then their predicted values will differ by $ \hat \beta_1 d$.
The Estimates Are Not the Same as the True Values
- It is important to understand the difference between the least-squares estimates $ \hat \beta_0$ and $ \hat \beta_1$, and the true values $\beta_0$ and $\beta_1$.
- The true values are constants whose values are unknown.
- The estimates are quantities that are computed from the data. We may use the estimates as approximations for the true values.
- Therefore $ \hat \beta_0$ and $ \hat \beta_1$ are random variables, since their values vary from experiment to experiment.
- To make full use of these estimates, we will need to be able to compute their standard deviations.
The Residuals Are Not the Same as the Errors
Don’t Extrapolate Outside the Range of the Data
- For many variables, linear relationships hold within a certain range, but not outside it.
- If we extrapolate a least squares line outside the range of the data, therefore, there is no guarantee that it will properly describe the relationship.
- If we want to know how the spring will respond to a load of 100 lb, we must include weights of 100 lb or more in the data set.
Don’t Use the Least-Squares Line When the Data Aren’t Linear
When the scatterplot follows a curved pattern, it does not make sense to summarize
it with a straight line.
Measuring Goodness-of-Fit
A goodness-of-fit statistic is a quantity that measures how well a model explains a given
set of data.
r is a goodness-of-fit statistic for the linear model.
The points on the scatterplot are $(x_i, y_i)$ where $x_i$ is the height of the ith man and $y_i$ is the length of his forearm. \(r^2 = \frac{Regression ~ sum ~ of ~ squares}{ Total ~ sum ~ of ~ squares}\)
The sums of squares appearing in this discussion are used so often that statisticians have given them names. They call
$ \Sigma^n_{i=1}\Sigma (y_i − \hat y_i)^2$ the error sum of squares and $ n_i =\frac {1}{(y_i − y)^2}$ the total sum of squares. Their difference
$ \Sigma^n_{i=1}(y_i − y)^2 − \Sigma^n_{i =1}(y_i - \hat y_i)^2$
is called the regression sum of squares.